Nitrogen and Iron Availability Drive Metabolic Remodeling and Natural Selection of Diverse Phytoplankton during Experimental Upwelling

ABSTRACT Nearly half of carbon fixation and primary production originates from marine phytoplankton, and much of it occurs in episodic blooms in upwelling regimes. Here, we simulated blooms limited by nitrogen and iron by incubating Monterey Bay surface waters with subnutricline waters and inorganic nutrients and measured the whole-community transcriptomic response during mid- and late-bloom conditions. Cell counts revealed that centric and pennate diatoms (largely Pseudo-nitzschia and Chaetoceros spp.) were the major blooming taxa, but dinoflagellates, prasinophytes, and prymnesiophytes also increased. Viral mRNA significantly increased in late bloom and likely played a role in the bloom’s demise. We observed conserved shifts in the genetic similarity of phytoplankton populations to cultivated strains, indicating adaptive population-level changes in community composition. Additionally, the density of single nucleotide variants (SNVs) declined in late-bloom samples for most taxa, indicating a loss of intraspecific diversity as a result of competition and a selective sweep of adaptive alleles. We noted differences between mid- and late-bloom metabolism and differential regulation of light-harvesting complexes (LHCs) under nutrient stress. While most LHCs are diminished under nutrient stress, we showed that diverse taxa upregulated specialized, energy-dissipating LHCs in low iron. We also suggest the relative expression of NRT2 compared to the expression of GSII as a marker of cellular nitrogen status and the relative expression of iron starvation-induced protein genes (ISIP1, ISIP2, and ISIP3) compared to the expression of the thiamine biosynthesis gene (thiC) as a marker of iron status in natural diatom communities. IMPORTANCE Iron and nitrogen are the nutrients that most commonly limit phytoplankton growth in the world’s oceans. The utilization of these resources by phytoplankton sets the biomass available to marine systems and is of particular interest in high-nutrient, low-chlorophyll (HNLC) coastal fisheries. Previous research has described the biogeography of phytoplankton in HNLC regions and the transcriptional responses of representative taxa to nutrient limitation. However, the differential transcriptional responses of whole phytoplankton communities to iron and nitrogen limitation has not been previously described, nor has the selective pressure that these competitive bloom environments exert on major players. In addition to describing changes in the physiology of diverse phytoplankton, we suggest practical indicators of cellular nitrogen and iron status for future monitoring.

The mean size of the libraries was around 400 base pairs. Resulting libraries were subjected to paired-end sequencing via Illumina HiSeq.

Amplicon library preparation and sequencing
1 µg total RNA from each sample was converted to cDNA using the SuperScript-III First Strand cDNA Synthesis System with 1 µL random hexamer primers. 16S and 18S rRNA PCR amplifications were each performed using the Life Technologies AccuPrime PCR system kit in barcodes. PCR cycling conditions consisted of an initial denaturation at 95°C for 2 minutes, 30 cycles of 95°C for 20 seconds, 56°C for 30 seconds, and 72°C for 5 minutes. PCR products (3 µl of each sample and 5 µl of negative control) were run on a 1% agarose gel at 110 V for 70 minutes, cleaned up using the AMPure XP bead kit (Beckman Coulter Life Sciences, Brea CA), and resuspended in 25 µL of Qiagen elution buffer (EB). The final product was visualized for quality assessment on an agarose gel (2.5 µL) and quantified using in a LifeTechnologies' PicoGreen Quant-IT assay (1 µL). Using this quantification, 20 ng of PCR product from each sample (both 16S and 18S rRNA) was pooled for 454 pyrosequencing.

16S and 18S Amplicons
Raw reads were demultiplexed and trimmed of adapters and low-quality sequences using an in-house script. Paired-end reads were merged using PEAR (5) with the default parameters.
Merged sequences were searched for rDNA via matching for the rRNA reference covariance models using Infernal (6). Taxonomic classification of the predicted rDNA sequences was determined by conducting a blastn (7) search against the SILVA (8). A subset of the reference (SILVA) rRNA sequences was manually identified to be included in building a reference phylogenetic tree. The reference sequences were aligned with MAFFT (9). The generated multiple sequence alignment was visually inspected and manually edited and refined using JalView (10). A maximum-likelihood reference tree was inferred under the general timereversible model with gamma-distributed rate heterogeneity and an estimated proportion of invariant sites (GTR + Γ + I), implemented in RAxML (11). The predicted metagenomic rDNA sequences were mapped onto the reference tree using pplacer (12) with the default settings. The counts of the sequences affiliated with the nodes on the reference tree were normalized to the total number of sequences from their corresponding samples. The normalized abundances are visualized as circles mapped on the reference tree such that the diameters of the circles are proportionate to the taxonomic abundances.

Metatranscriptomics
Illumina reads were processed via the RNAseq Annotation Pipeline (RAP) (13) as previously described (14) (SD3). Briefly, reads were trimmed and filtered with a length minimum of 30 base pairs and a quality score minimum of 33. Ribopicker v.0.4.3 (15) was used to remove ribosomal RNA (rRNA) reads. CLC Genomics Workbench 9.5.3 (https://www.qiagenbioinformatics.com/) was used to assemble reads first by library, then overall. FragGeneScan (16) was used for ab initio ORF prediction (that is to say, gene prediction based on signal detection in the sequence itself, rather than by comparison to known genes).
ORFs were screened for contamination in the form of rRNA, ITS, and primers. Organellar ORFs (those with closer homology to a known organelle gene than that of a nuclear gene in the same reference organism) were identified for separate analysis.
Ab initio ORFs were annotated for function de novo by assigning Pfams (17) (19). Briefly, LPI was calculated here as a value between 0 and 1 indicating lineage commonality among the top 95-percentile of sequences based on BLAST bit-score.
Reads were then mapped competitively to the full suite of ab initio ORFs to generate counts for each sample. Taxonomic counts were prepared by tallying reads mapped to taxonomically annotated ORFs.

Mapping to reference transcriptomes
The 22 reference transcriptomes that most commonly appeared as best hits for ab initio ORFs were also mapped to directly. These reference transcriptomes were sourced from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP; http://marinemicroeukaryotes.org/). Reads from the incubation experiment were aligned to both MMETSP reference contigs and ORFs with BWA-MEM(20, 21) using default parameters.

SNV detection
Single nucleotide variants (SNVs) were detected among reads mapping to ab initio ORFs to a maximum read depth of 8000 using samtools mpileup (22) with the -C50 parameter to reduce the effect of reads with excessive mismatches and -A to include anomalous read pairs, and bcftools call with the consensus calling method (-c). Libraries were subsampled to the depth of the lowest-coverage library to reduce coverage biases in SNV detection. Only SNVs with a quality score of at least 10, a root mean square mapping quality of at least 40, and at least 10 reads covering the call position were reported.
Directional edge weights were defined as the ratio of pairwise-to self-BLASTP scores, and default parameters were used to assign ORFs to clusters. Clusters were assigned a consensus annotation if found to be statistically enriched in that annotation with a Fisher's exact test (p < 0.05). Consensus annotations must also represent at least 10% of the reads in the cluster and account for a minimum of 200 reads.

Differential expression (DE) of taxa groups, reference transcriptome ORFs, ab initio
ORFs, and ortholog clusters across bloom conditions was identified using edgeR version 3.16.5 (24). Read counts were normalized using the "calcNormFactors" function, which accounts for both library size and varied library composition. This function performs trimmed mean of Mvalues (TMM) normalization to account for a common RNA-seq effect in which the oversampling of highly expressed genes falsely causes low-expression genes to appear downregulated. Tagwise dispersions were estimated using quantile-adjusted conditional maximum likelihood (qCML) method via the "estimateTagwiseDisp" function. P-values were computed using the "exactTest" function, which is analogous to Fisher's exact test but uses a negative binomial distribution, and FDR-corrected using the Benjamini-Hochberg method. DE of ortholog clusters was determined by using a Fisher's exact test (FDR < 0.05) in R to determine if each cluster was enriched in up-regulated ORFs, down-regulated ORFs or differentially expressed ORFs for a given taxa group. Manta plots of DE were created using the R package manta version 1.28.1.
For ab initio ORF DE, additional normalization strategies were implemented in order to validate the use of traditional edgeR parameters. EdgeR traditionally normalizes by distribution (TMM normalization), which assumes roughly the same number of up-and down-regulated genes across conditions. Due to the large change in biomass and physiology inherent in a bloom experiment, we sought to validate our results by normalizing to growth biomarkers including a regulator of ribosome biogenesis (Nop53), a housekeeping gene (60S ribosomal protein L7, rpl7; KOG3184) with validated stability across metazoans (25,26), plants (27) and algae (28,29), and a gene cluster that was highly correlated with cell counts (cluster 630, R > 0.9). Marker gene normalization has previously been shown to improve the accuracy of edgeR in cases when DE is not symmetric across conditions (30,31). Size factors for edgeR normalization were calculated in DEseq2 verison 1.14.1 (32) by providing the function estimateSizeFactors with the chosen markers as controlGenes. The size factors were then imported into edgeR using the norm.factors parameter of DGEList. In all cases, an exact test with tagwise dispersion estimation was used to determine ORFs with significantly different expression across size classes (FDR corrected p < 0.05). While the number of genes classified as differentially expressed varied widely across normalization strategy, gene functions with strong DE were chiefly conserved across strategies, and common markers of N and Fe stress were detected across all methods (SD4). Because these strategies yielded similar results, we chose to use the simplest, default normalization method (library normalization with "calcNormFactors") for the differential expression data presented throughout the text.

Identification of light harvesting complex proteins
Light harvesting complex (LHC) sequences were mined from the metatranscriptomic datasets using Hidden Markov Models (HMMs; available at https://github.com/allenlab/Data_Files/blob/main/LHC_hmm). HMMs were constructed and validated using references from our in-house database, phyloDB, of complete genomes and eukaryotic transcriptomes (phyloDB 1.075 available at https://scripps.ucsd.edu/labs/aallen/data/). Sequences were trimmed to 130 amino acids, aligned using MUSCLE, and reference phylogeny was constructed using Phyml. Placement of sequencing reads from each dataset was performed using pplacer, and the final tree was drawn using in-house software for phylogenetic placement predication visualization (available at https://github.com/mccrowjp/slacTree.git). Circles indicate the relative abundance of sequences at node placement, abundance is scaled to the greatest value, the 'absize' (5 here) that is a multiple used for scaling each radius size. In slacTree, final abundances are calculated using (sqrt(scaled abundance) x viewwidth x absize x 0.001).
Viewwidth is given during scalable vector graphic (svg) formatting and refers to the plot size.